Search CORE

14 research outputs found

Guarded Policy Optimization with Imperfect Online Demonstrations

Author: Li Quanyi
Liu Zhihan
Peng Zhenghao
Xue Zhenghai
Zhou Bolei
Publication venue
Publication date: 23/04/2023
Field of study

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.Comment: Accepted at ICLR 2023 (top 25%

arXiv.org e-Print Archive

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

Author: Feng Lan
Li Quanyi
Peng Zhenghao
Xue Zhenghai
Zhang Qihang
Zhou Bolei
Publication venue
Publication date: 18/04/2022
Field of study

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the generalizability of the RL agents. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive . More research projects based on MetaDrive simulator are listed at https://metadriverse.github.ioComment: Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive . More research projects based on MetaDrive simulator are listed at https://metadriverse.github.i

arXiv.org e-Print Archive

PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

Author: An Bo
Cai Qingpeng
Gai Kun
Jiang Peng
Liu Shuchang
Sun Shuo
Xue Wanqi
Xue Zhenghai
Zheng Dong
Publication venue
Publication date: 02/06/2023
Field of study

Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks

arXiv.org e-Print Archive

State Regularized Policy Optimization on Data with Dynamics Shift

Author: An Bo
Cai Qingpeng
Gai Kun
Jiang Peng
Liu Shuchang
Xue Zhenghai
Zheng Dong
Publication venue
Publication date: 06/06/2023
Field of study

In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.Comment: Preprint. Under Revie

arXiv.org e-Print Archive

AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement

Author: An Bo
Cai Qingpeng
Gai Kun
Hu Lantao
Jiang Peng
Xue Zhenghai
Yang Bin
Zuo Tianyou
Publication venue
Publication date: 05/10/2023
Field of study

Growing attention has been paid to Reinforcement Learning (RL) algorithms when optimizing long-term user engagement in sequential recommendation tasks. One challenge in large-scale online recommendation systems is the constant and complicated changes in users' behavior patterns, such as interaction rates and retention tendencies. When formulated as a Markov Decision Process (MDP), the dynamics and reward functions of the recommendation system are continuously affected by these changes. Existing RL algorithms for recommendation systems will suffer from distribution shift and struggle to adapt in such an MDP. In this paper, we introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue. AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories. Such information reflects how RL policy fits to current user behavior patterns, and helps the policy to identify subtle changes in the recommendation system. To make rapid adaptation to these changes, AdaRec encourages exploration with the idea of optimism under uncertainty. The exploration is further guarded by zero-order action optimization to ensure stable recommendation quality in complicated environments. We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks, where AdaRec exhibits superior long-term performance compared to all baseline algorithms.Comment: Preprint. Under Revie

arXiv.org e-Print Archive

Gene Expression Profiles Deciphering Rice Phenotypic Variation between Nipponbare (Japonica) and 93-11 (Indica) during Oxidative Stress

Author: Di Chao
Ling Yi
Liu Fengxia
Su Zhen
Sun Chuanqing
Tan Lubin
Tan Yuanjun
Wang Chunchao
Wei Qiang
Xing Zhuo
Xu Wenying
Xue Yongbiao
Yan Hong
Yao Dongxia
Zhang Zhenghai
Publication venue: Public Library of Science
Publication date: 08/01/2010
Field of study

Rice is a very important food staple that feeds more than half the world's population. Two major Asian cultivated rice (Oryza sativa L.) subspecies, japonica and indica, show significant phenotypic variation in their stress responses. However, the molecular mechanisms underlying this phenotypic variation are still largely unknown. A common link among different stresses is that they produce an oxidative burst and result in an increase of reactive oxygen species (ROS). In this study, methyl viologen (MV) as a ROS agent was applied to investigate the rice oxidative stress response. We observed that 93-11 (indica) seedlings exhibited leaf senescence with severe lesions under MV treatment compared to Nipponbare (japonica). Whole-genome microarray experiments were conducted, and 1,062 probe sets were identified with gene expression level polymorphisms between the two rice cultivars in addition to differential expression under MV treatment, which were assigned as Core Intersectional Probesets (CIPs). These CIPs were analyzed by gene ontology (GO) and highlighted with enrichment GO terms related to toxin and oxidative stress responses as well as other responses. These GO term-enriched genes of the CIPs include glutathine S-transferases (GSTs), P450, plant defense genes, and secondary metabolism related genes such as chalcone synthase (CHS). Further insertion/deletion (InDel) and regulatory element analyses for these identified CIPs suggested that there may be some eQTL hotspots related to oxidative stress in the rice genome, such as GST genes encoded on chromosome 10. In addition, we identified a group of marker genes individuating the japonica and indica subspecies. In summary, we developed a new strategy combining biological experiments and data mining to study the possible molecular mechanism of phenotypic variation during oxidative stress between Nipponbare and 93-11. This study will aid in the analysis of the molecular basis of quantitative traits

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China

Author: Aimin Wang
Donghui Wu
Haifeng Zhang
Huali Wang
Jianhua He
Jianzhong Zhu
Jing Zhang
Jun Xu
Lingchuan Xiong
Maya Semrau
Mei Zhao
Nan Mu
Nan Zhang
Norman Sartorius
Qinpu Jiang
Tao Li
Wei Chen
Xiao Wang
Xiaozhen Lv
Xin Yu
Xue Meng
Yang Li
Yongan Sun
Yuping Zhao
Zhanjie Zheng
Zhenghai Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2017
Field of study

Abstract Background Clinical and social services both are important for dementia care. The International Dementia Alliance (IDEAL) Schedule for the Assessment and Staging of Care was developed to guide clinical and social care for dementia. Our study aimed to assess the validity and reliability of the IDEAL schedule in China. Methods Two hundred eighty-two dementia patients and their caregivers were recruited from 15 hospitals in China. Each patient-caregiver dyad was assessed with the IDEAL schedule by a rater and an observer simultaneously. The Clinical Dementia Rating (CDR), Mini-Mental Status Examination (MMSE), and Caregiver Burden Inventory (CBI) were assessed for criterion validity. IDEAL repeated assessment was conducted 7-10 days after the initial interview for 62 dyads. Results Two hundred seventy-seven patient-caregiver dyads completed the IDEAL assessment. Inter-rater reliability for the total score of the IDEAL schedule was 0.93 (95%CI = 0.92-0.95). The inter-class coefficient for the total score of IDEAL was 0.95 for the interviewers and 0.93 for the silent raters. The IDEAL total score correlated with the global CDR score (ρ = 0.72, p < 0.001), the CDR-sum of box (CDR-SOB, ρ = 0.74, p < 0.001), the total score of MMSE (ρ = −0.65, p < 0.001) and CBI (ρ = 0.70, p < 0.001). All item scores of the IDEAL schedule were associated with the CDR-SOB (ρ = 0.17 ~ 0.79, all p < 0.05). Conclusion The IDEAL schedule is a valid and reliable tool for the staging of care for dementia in the Chinese population

Directory of Open Access Journals

Additional file 3: Table S2. of Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China

Correlation of item scores of IDEAL, Chinese version, against factor scores of CBI. (DOCX 16Â kb

FigShare

Additional file 1: of Reliability and validity of the international dementia alliance schedule for the assessment and staging of care in China

Study groups, raters and participating hospitals (in alphabetic order by province or administrative city). (DOCX 15Â kb

FigShare